Corpus-Based Evaluation of Prosodic Phrase Break Prediction Using nltk_lite’s Chunk Parser to Detect Prosodic Phrase Boundaries in the Aix-MARSEC Corpus of Spoken English

نویسندگان

  • Claire Brierley
  • Eric Atwell
چکیده

An automatic phrase break prediction system aims to identify prosodic-syntactic boundaries in text which correspond to the way a native speaker might process or chunk that same text as speech. In computational linguistics, Machine Learning from hand-annotated corpus data has become the de-facto standard approach to text annotation problems such as prosodic annotation. This is treated as a classification task in machine learning and output predictions from language models are evaluated against “gold standard” prosodic phrase break annotations in a speech corpus. Despite the application of rigorous metrics such as precision and recall, the evaluation of phrase break models is still problematic because prosody is inherently variable: a given linguist’s set of morphosyntactic analysis and prosodic annotations for a given text may not be fully representative of the range of parsing and phrasing strategies available to, and exhibited by, native speakers. A fairer approach to evaluation requires POS tagged and prosodically annotated variants of a text to enrich the gold standard and enable more robust ‘noise-tolerant’ measurement of language models. We report on experiments with the AIX-MARSEC spoken English corpus. This has already been richly annotated at several linguistic levels, allowing a range of features to be applied in Machine Learning of phrase break prediction. We have developed a rule-based prosodic phrase break predictor, which can be used to enrich the phrase-break mark-up, to expand from a single linguist’s analysis to include a wider range of possible interpretations of the text. This allows for different predictions to both score well if the prosody is plausible, even if the predicted phrase breaks differ from the corpus linguist’s analysis. Prosodic phrasing is the means by which speakers of any given language break up an utterance into meaningful chunks. The term ‘prosody’ itself refers to the tune or intonation of an utterance and therefore prosodic phrases literally signal the end of one tune and the beginning of another. This study uses phrase break annotations in the Aix-MARSEC Corpus of spoken English as a “gold standard” for measuring the degree of correspondence between prosodic phrases and the discrete syntactic grouping of prepositional phrases, where the latter is defined via a chunk parse rule using nltk_lite’s regular expression chunk parser. A three-way comparison is also introduced between “gold standard”, chunk parse rule and human judgement in the form of intuitive predictions about phrasing. Results show that even with a discrete syntactic grouping and a small sample of text (around 1400 words), problems arise for this rule-based method due to uncategorical behaviour in parts of speech. Lack of correspondence between intuitive prosodic phrases and corpus annotations highlights

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prosody resources and symbolic prosodic features for automated phrase break prediction

It is universally recognised that humans process speech and language in chunks, each meaningful in itself. Any two renditions or assimilations of a given sentence will exhibit similarities and discrepancies in chunking, where speakers and readers use pauses and inflections to mark phrase breaks. This thesis reviews deterministic and stochastic approaches to phrase break prediction, plus dataset...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Automatic, model-based detection of pause-less phrase boundaries from fundamental frequency and duration features

Prosodic phrase boundaries (PBs) are a key aspect of spoken communication. In automatic PB detection, it is common to use local acoustic features, textual features, or a combination of both. Most approaches – regardless of features used – succeed in detecting major PBs (break score “4” in ToBI annotation, typically involving a pause) while detection of intermediate PBs (break score “3” in ToBI ...

متن کامل

Prosodic Phrase Break Prediction: Problems in the Evaluation of Models against a Gold Standard. (Prédiction des frontières prosodiques entre syntagmes : le problème de l'évaluation des modèles à l'aide d'un corpus de référence)

The goal of automatic phrase break prediction is to identify prosodic-syntactic boundaries in text which correspond to the way a native speaker might process or chunk that same text as speech. This is treated as a classification task in machine learning and output predictions from language models are evaluated against a ‘gold standard’: human-labelled prosodic phrase break annotations in transc...

متن کامل

Prosodic Phrase Break Prediction: Problems in the Evaluation of Models against a Gold Standard

The goal of automatic phrase break prediction is to identify prosodic-syntactic boundaries in text which correspond to the way a native speaker might process or chunk that same text as speech. This is treated as a classification task in machine learning and output predictions from language models are evaluated against a ‘gold standard’: human-labelled prosodic phrase break annotations in transc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007